Nama:
print('Hanun Masitha Ramadhani')
Hanun Masitha Ramadhani
(1) Lakukan EDA univariat untuk setiap kolom numerik pada employee.csv yang mencakup:
a. histogram dan boxplot untuk tiap kolom
b. metrik statistik dasar untuk tiap kolom: mean, std, min, q1, q2, q3, iqr, max
c. identifikasi nilai upper whisker dan lower whisker dari boxplot tiap kolom
d. apabila terdapat outlier (<q1-1.5iqr | >q3+1.5iqr): hitung count, proportion, dan list dari outlier tiap kolom
e. hitung metrik skew dan lakukan skewtest untuk tiap kolom
f. identifikasi hal yang menurut anda menarik dari hasil EDA yang Anda dapatkan
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
pd_employee = pd.read_csv('employee.csv')
pd_employee
| Unnamed: 0 | EmployeeNumber | Attrition | Age | BusinessTravel | DailyRate | Department | DistanceFromHome | Education | EducationField | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | Yes | 41 | Travel_Rarely | 1102 | Sales | 1 | 2 | Life Sciences | ... | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 1 | 2 | No | 49 | Travel_Frequently | 279 | Research & Development | 8 | 1 | Life Sciences | ... | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 2 | 3 | Yes | 37 | Travel_Rarely | 1373 | Research & Development | 2 | 2 | Other | ... | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 3 | 4 | No | 33 | Travel_Frequently | 1392 | Research & Development | 3 | 4 | Life Sciences | ... | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 4 | 5 | No | 27 | Travel_Rarely | 591 | Research & Development | 2 | 1 | Medical | ... | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2935 | 2935 | 2936 | No | 36 | Travel_Frequently | 884 | Research & Development | 23 | 2 | Medical | ... | 3 | 80 | 1 | 17 | 3 | 3 | 5 | 2 | 0 | 3 |
| 2936 | 2936 | 2937 | No | 39 | Travel_Rarely | 613 | Research & Development | 6 | 1 | Medical | ... | 1 | 80 | 1 | 9 | 5 | 3 | 7 | 7 | 1 | 7 |
| 2937 | 2937 | 2938 | No | 27 | Travel_Rarely | 155 | Research & Development | 4 | 3 | Life Sciences | ... | 2 | 80 | 1 | 6 | 0 | 3 | 6 | 2 | 0 | 3 |
| 2938 | 2938 | 2939 | No | 49 | Travel_Frequently | 1023 | Sales | 2 | 3 | Medical | ... | 4 | 80 | 0 | 17 | 3 | 2 | 9 | 6 | 0 | 8 |
| 2939 | 2939 | 2940 | No | 34 | Travel_Rarely | 628 | Research & Development | 8 | 3 | Medical | ... | 1 | 80 | 0 | 6 | 3 | 4 | 4 | 3 | 1 | 2 |
2940 rows × 35 columns
pd_employee.columns
Index(['Unnamed: 0', 'EmployeeNumber', 'Attrition', 'Age', 'BusinessTravel',
'DailyRate', 'Department', 'DistanceFromHome', 'Education',
'EducationField', 'EnvironmentSatisfaction', 'Gender', 'HourlyRate',
'JobInvolvement', 'JobLevel', 'JobRole', 'JobSatisfaction',
'MaritalStatus', 'MonthlyIncome', 'MonthlyRate', 'NumCompaniesWorked',
'Over18', 'OverTime', 'PercentSalaryHike', 'PerformanceRating',
'RelationshipSatisfaction', 'StandardHours', 'StockOptionLevel',
'TotalWorkingYears', 'TrainingTimesLastYear', 'WorkLifeBalance',
'YearsAtCompany', 'YearsInCurrentRole', 'YearsSinceLastPromotion',
'YearsWithCurrManager'],
dtype='object')
employee_number = pd_employee.select_dtypes(include = 'number')
employee_number
| Unnamed: 0 | EmployeeNumber | Age | DailyRate | DistanceFromHome | Education | EnvironmentSatisfaction | HourlyRate | JobInvolvement | JobLevel | ... | RelationshipSatisfaction | StandardHours | StockOptionLevel | TotalWorkingYears | TrainingTimesLastYear | WorkLifeBalance | YearsAtCompany | YearsInCurrentRole | YearsSinceLastPromotion | YearsWithCurrManager | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 1 | 41 | 1102 | 1 | 2 | 2 | 94 | 3 | 2 | ... | 1 | 80 | 0 | 8 | 0 | 1 | 6 | 4 | 0 | 5 |
| 1 | 1 | 2 | 49 | 279 | 8 | 1 | 3 | 61 | 2 | 2 | ... | 4 | 80 | 1 | 10 | 3 | 3 | 10 | 7 | 1 | 7 |
| 2 | 2 | 3 | 37 | 1373 | 2 | 2 | 4 | 92 | 2 | 1 | ... | 2 | 80 | 0 | 7 | 3 | 3 | 0 | 0 | 0 | 0 |
| 3 | 3 | 4 | 33 | 1392 | 3 | 4 | 4 | 56 | 3 | 1 | ... | 3 | 80 | 0 | 8 | 3 | 3 | 8 | 7 | 3 | 0 |
| 4 | 4 | 5 | 27 | 591 | 2 | 1 | 1 | 40 | 3 | 1 | ... | 4 | 80 | 1 | 6 | 3 | 3 | 2 | 2 | 2 | 2 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2935 | 2935 | 2936 | 36 | 884 | 23 | 2 | 3 | 41 | 4 | 2 | ... | 3 | 80 | 1 | 17 | 3 | 3 | 5 | 2 | 0 | 3 |
| 2936 | 2936 | 2937 | 39 | 613 | 6 | 1 | 4 | 42 | 2 | 3 | ... | 1 | 80 | 1 | 9 | 5 | 3 | 7 | 7 | 1 | 7 |
| 2937 | 2937 | 2938 | 27 | 155 | 4 | 3 | 2 | 87 | 4 | 2 | ... | 2 | 80 | 1 | 6 | 0 | 3 | 6 | 2 | 0 | 3 |
| 2938 | 2938 | 2939 | 49 | 1023 | 2 | 3 | 4 | 63 | 2 | 2 | ... | 4 | 80 | 0 | 17 | 3 | 2 | 9 | 6 | 0 | 8 |
| 2939 | 2939 | 2940 | 34 | 628 | 8 | 3 | 2 | 82 | 4 | 2 | ... | 1 | 80 | 0 | 6 | 3 | 4 | 4 | 3 | 1 | 2 |
2940 rows × 26 columns
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
for x in employee_number.columns:
# histogram
fig, axs = plt.subplots(2,1, sharex=True)
sns.histplot(employee_number[x], ax = axs[1])
plt.axvline(np.percentile(employee_number[x], 25), c='red', linestyle='--')
plt.axvline(np.median(employee_number[x]), c='red', linestyle='--')
plt.axvline(np.percentile(employee_number[x], 75), c='red', linestyle='--')
# boxplot
sns.boxplot(employee_number[x], ax = axs[0])
plt.show()
for x in employee_number.columns:
# statistic metrics
print(x)
print(employee_number[x].describe())
print("\n")
Unnamed: 0 count 2940.000000 mean 1469.500000 std 848.849221 min 0.000000 25% 734.750000 50% 1469.500000 75% 2204.250000 max 2939.000000 Name: Unnamed: 0, dtype: float64 EmployeeNumber count 2940.000000 mean 1470.500000 std 848.849221 min 1.000000 25% 735.750000 50% 1470.500000 75% 2205.250000 max 2940.000000 Name: EmployeeNumber, dtype: float64 Age count 2940.000000 mean 36.923810 std 9.133819 min 18.000000 25% 30.000000 50% 36.000000 75% 43.000000 max 60.000000 Name: Age, dtype: float64 DailyRate count 2940.000000 mean 802.485714 std 403.440447 min 102.000000 25% 465.000000 50% 802.000000 75% 1157.000000 max 1499.000000 Name: DailyRate, dtype: float64 DistanceFromHome count 2940.000000 mean 9.192517 std 8.105485 min 1.000000 25% 2.000000 50% 7.000000 75% 14.000000 max 29.000000 Name: DistanceFromHome, dtype: float64 Education count 2940.000000 mean 2.912925 std 1.023991 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 5.000000 Name: Education, dtype: float64 EnvironmentSatisfaction count 2940.000000 mean 2.721769 std 1.092896 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 4.000000 Name: EnvironmentSatisfaction, dtype: float64 HourlyRate count 2940.000000 mean 65.891156 std 20.325969 min 30.000000 25% 48.000000 50% 66.000000 75% 84.000000 max 100.000000 Name: HourlyRate, dtype: float64 JobInvolvement count 2940.000000 mean 2.729932 std 0.711440 min 1.000000 25% 2.000000 50% 3.000000 75% 3.000000 max 4.000000 Name: JobInvolvement, dtype: float64 JobLevel count 2940.000000 mean 2.063946 std 1.106752 min 1.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 5.000000 Name: JobLevel, dtype: float64 JobSatisfaction count 2940.000000 mean 2.728571 std 1.102658 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 4.000000 Name: JobSatisfaction, dtype: float64 MonthlyIncome count 2940.000000 mean 6502.931293 std 4707.155770 min 1009.000000 25% 2911.000000 50% 4919.000000 75% 8380.000000 max 19999.000000 Name: MonthlyIncome, dtype: float64 MonthlyRate count 2940.000000 mean 14313.103401 std 7116.575021 min 2094.000000 25% 8045.000000 50% 14235.500000 75% 20462.000000 max 26999.000000 Name: MonthlyRate, dtype: float64 NumCompaniesWorked count 2940.000000 mean 2.693197 std 2.497584 min 0.000000 25% 1.000000 50% 2.000000 75% 4.000000 max 9.000000 Name: NumCompaniesWorked, dtype: float64 PercentSalaryHike count 2940.000000 mean 15.209524 std 3.659315 min 11.000000 25% 12.000000 50% 14.000000 75% 18.000000 max 25.000000 Name: PercentSalaryHike, dtype: float64 PerformanceRating count 2940.000000 mean 3.153741 std 0.360762 min 3.000000 25% 3.000000 50% 3.000000 75% 3.000000 max 4.000000 Name: PerformanceRating, dtype: float64 RelationshipSatisfaction count 2940.000000 mean 2.712245 std 1.081025 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 4.000000 Name: RelationshipSatisfaction, dtype: float64 StandardHours count 2940.0 mean 80.0 std 0.0 min 80.0 25% 80.0 50% 80.0 75% 80.0 max 80.0 Name: StandardHours, dtype: float64 StockOptionLevel count 2940.000000 mean 0.793878 std 0.851932 min 0.000000 25% 0.000000 50% 1.000000 75% 1.000000 max 3.000000 Name: StockOptionLevel, dtype: float64 TotalWorkingYears count 2940.000000 mean 11.279592 std 7.779458 min 0.000000 25% 6.000000 50% 10.000000 75% 15.000000 max 40.000000 Name: TotalWorkingYears, dtype: float64 TrainingTimesLastYear count 2940.000000 mean 2.799320 std 1.289051 min 0.000000 25% 2.000000 50% 3.000000 75% 3.000000 max 6.000000 Name: TrainingTimesLastYear, dtype: float64 WorkLifeBalance count 2940.000000 mean 2.761224 std 0.706356 min 1.000000 25% 2.000000 50% 3.000000 75% 3.000000 max 4.000000 Name: WorkLifeBalance, dtype: float64 YearsAtCompany count 2940.000000 mean 7.008163 std 6.125483 min 0.000000 25% 3.000000 50% 5.000000 75% 9.000000 max 40.000000 Name: YearsAtCompany, dtype: float64 YearsInCurrentRole count 2940.000000 mean 4.229252 std 3.622521 min 0.000000 25% 2.000000 50% 3.000000 75% 7.000000 max 18.000000 Name: YearsInCurrentRole, dtype: float64 YearsSinceLastPromotion count 2940.000000 mean 2.187755 std 3.221882 min 0.000000 25% 0.000000 50% 1.000000 75% 3.000000 max 15.000000 Name: YearsSinceLastPromotion, dtype: float64 YearsWithCurrManager count 2940.000000 mean 4.123129 std 3.567529 min 0.000000 25% 2.000000 50% 3.000000 75% 7.000000 max 17.000000 Name: YearsWithCurrManager, dtype: float64
for x in employee_number.columns:
print(x)
q1 = np.percentile(employee_number[x], 25)
q3 = np.percentile(employee_number[x], 75)
iqr = q3-q1
upperbound = q3+1.5*iqr
lowerbound = q1-1.5*iqr
upperlimit = np.max(employee_number[x][employee_number[x]<=upperbound])
lowerlimit = np.min(employee_number[x][employee_number[x]>=lowerbound])
print('upperlimit: {}'.format(upperlimit))
print('lowerlimit: {}'.format(lowerlimit))
outlier = employee_number[x][(employee_number[x]>upperbound) | (employee_number[x]<lowerbound)]
print('Outlier:')
print('Count : {}'.format(len(outlier)))
print('Proportion : {}'.format(len(outlier)/len(employee_number[x])))
print('List : {}'.format(list(outlier)))
print("\n")
Unnamed: 0 upperlimit: 2939 lowerlimit: 0 Outlier: Count : 0 Proportion : 0.0 List : [] EmployeeNumber upperlimit: 2940 lowerlimit: 1 Outlier: Count : 0 Proportion : 0.0 List : [] Age upperlimit: 60 lowerlimit: 18 Outlier: Count : 0 Proportion : 0.0 List : [] DailyRate upperlimit: 1499 lowerlimit: 102 Outlier: Count : 0 Proportion : 0.0 List : [] DistanceFromHome upperlimit: 29 lowerlimit: 1 Outlier: Count : 0 Proportion : 0.0 List : [] Education upperlimit: 5 lowerlimit: 1 Outlier: Count : 0 Proportion : 0.0 List : [] EnvironmentSatisfaction upperlimit: 4 lowerlimit: 1 Outlier: Count : 0 Proportion : 0.0 List : [] HourlyRate upperlimit: 100 lowerlimit: 30 Outlier: Count : 0 Proportion : 0.0 List : [] JobInvolvement upperlimit: 4 lowerlimit: 1 Outlier: Count : 0 Proportion : 0.0 List : [] JobLevel upperlimit: 5 lowerlimit: 1 Outlier: Count : 0 Proportion : 0.0 List : [] JobSatisfaction upperlimit: 4 lowerlimit: 1 Outlier: Count : 0 Proportion : 0.0 List : [] MonthlyIncome upperlimit: 16555 lowerlimit: 1009 Outlier: Count : 228 Proportion : 0.07755102040816327 List : [19094, 18947, 19545, 18740, 18844, 18172, 17328, 16959, 19537, 17181, 19926, 19033, 18722, 19999, 16792, 19232, 19517, 19068, 19202, 19436, 16872, 19045, 19144, 17584, 18665, 17068, 19272, 18300, 16659, 19406, 19197, 19566, 18041, 17046, 17861, 16835, 16595, 19502, 18200, 16627, 19513, 19141, 19189, 16856, 19859, 18430, 17639, 16752, 19246, 17159, 17924, 17099, 17444, 17399, 19419, 18303, 19973, 19845, 17650, 19237, 19627, 16756, 17665, 16885, 17465, 19626, 19943, 18606, 17048, 17856, 19081, 17779, 19740, 18711, 18265, 18213, 18824, 18789, 19847, 19190, 18061, 17123, 16880, 17861, 19187, 19717, 16799, 17328, 19701, 17169, 16598, 17007, 16606, 19586, 19331, 19613, 17567, 19049, 19658, 17426, 17603, 16704, 19833, 19038, 19328, 19392, 19665, 16823, 17174, 17875, 19161, 19636, 19431, 18880, 19094, 18947, 19545, 18740, 18844, 18172, 17328, 16959, 19537, 17181, 19926, 19033, 18722, 19999, 16792, 19232, 19517, 19068, 19202, 19436, 16872, 19045, 19144, 17584, 18665, 17068, 19272, 18300, 16659, 19406, 19197, 19566, 18041, 17046, 17861, 16835, 16595, 19502, 18200, 16627, 19513, 19141, 19189, 16856, 19859, 18430, 17639, 16752, 19246, 17159, 17924, 17099, 17444, 17399, 19419, 18303, 19973, 19845, 17650, 19237, 19627, 16756, 17665, 16885, 17465, 19626, 19943, 18606, 17048, 17856, 19081, 17779, 19740, 18711, 18265, 18213, 18824, 18789, 19847, 19190, 18061, 17123, 16880, 17861, 19187, 19717, 16799, 17328, 19701, 17169, 16598, 17007, 16606, 19586, 19331, 19613, 17567, 19049, 19658, 17426, 17603, 16704, 19833, 19038, 19328, 19392, 19665, 16823, 17174, 17875, 19161, 19636, 19431, 18880] MonthlyRate upperlimit: 26999 lowerlimit: 2094 Outlier: Count : 0 Proportion : 0.0 List : [] NumCompaniesWorked upperlimit: 8 lowerlimit: 0 Outlier: Count : 104 Proportion : 0.03537414965986395 List : [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9] PercentSalaryHike upperlimit: 25 lowerlimit: 11 Outlier: Count : 0 Proportion : 0.0 List : [] PerformanceRating upperlimit: 3 lowerlimit: 3 Outlier: Count : 452 Proportion : 0.15374149659863945 List : [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4] RelationshipSatisfaction upperlimit: 4 lowerlimit: 1 Outlier: Count : 0 Proportion : 0.0 List : [] StandardHours upperlimit: 80 lowerlimit: 80 Outlier: Count : 0 Proportion : 0.0 List : [] StockOptionLevel upperlimit: 2 lowerlimit: 0 Outlier: Count : 170 Proportion : 0.05782312925170068 List : [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3] TotalWorkingYears upperlimit: 28 lowerlimit: 0 Outlier: Count : 126 Proportion : 0.04285714285714286 List : [31, 29, 37, 38, 30, 40, 36, 34, 32, 33, 37, 30, 36, 31, 33, 32, 37, 31, 32, 32, 30, 34, 30, 40, 29, 35, 31, 33, 31, 29, 32, 30, 33, 30, 29, 31, 32, 33, 36, 34, 31, 36, 33, 31, 29, 33, 29, 32, 31, 35, 29, 32, 34, 36, 32, 30, 36, 29, 34, 37, 29, 29, 35, 31, 29, 37, 38, 30, 40, 36, 34, 32, 33, 37, 30, 36, 31, 33, 32, 37, 31, 32, 32, 30, 34, 30, 40, 29, 35, 31, 33, 31, 29, 32, 30, 33, 30, 29, 31, 32, 33, 36, 34, 31, 36, 33, 31, 29, 33, 29, 32, 31, 35, 29, 32, 34, 36, 32, 30, 36, 29, 34, 37, 29, 29, 35] TrainingTimesLastYear upperlimit: 4 lowerlimit: 1 Outlier: Count : 476 Proportion : 0.1619047619047619 List : [0, 5, 5, 5, 6, 5, 5, 5, 6, 6, 0, 0, 0, 5, 0, 5, 5, 5, 6, 6, 5, 0, 6, 5, 5, 0, 5, 5, 6, 5, 5, 5, 0, 5, 5, 5, 5, 6, 6, 5, 5, 5, 5, 0, 0, 5, 5, 5, 6, 6, 5, 0, 5, 0, 5, 5, 0, 6, 0, 5, 5, 6, 6, 5, 6, 5, 0, 5, 5, 5, 5, 0, 6, 5, 5, 5, 5, 6, 5, 5, 6, 5, 5, 5, 0, 5, 0, 5, 5, 6, 5, 6, 5, 0, 5, 5, 0, 6, 6, 5, 6, 0, 5, 0, 6, 6, 6, 6, 5, 5, 0, 5, 0, 0, 6, 0, 6, 5, 6, 5, 5, 0, 5, 6, 6, 5, 5, 0, 0, 6, 0, 0, 5, 0, 5, 6, 5, 5, 6, 6, 5, 5, 5, 5, 5, 6, 5, 6, 6, 0, 6, 6, 5, 5, 0, 0, 6, 6, 0, 5, 0, 0, 0, 0, 0, 5, 5, 6, 5, 5, 0, 5, 5, 0, 5, 5, 6, 5, 5, 5, 6, 5, 5, 5, 0, 0, 5, 5, 5, 5, 6, 0, 0, 6, 6, 6, 6, 5, 5, 5, 6, 5, 0, 5, 5, 6, 5, 6, 6, 5, 6, 6, 5, 0, 5, 5, 5, 5, 5, 0, 0, 0, 6, 5, 6, 6, 5, 6, 0, 6, 6, 5, 6, 6, 5, 5, 5, 0, 0, 5, 5, 5, 6, 5, 5, 5, 6, 6, 0, 0, 0, 5, 0, 5, 5, 5, 6, 6, 5, 0, 6, 5, 5, 0, 5, 5, 6, 5, 5, 5, 0, 5, 5, 5, 5, 6, 6, 5, 5, 5, 5, 0, 0, 5, 5, 5, 6, 6, 5, 0, 5, 0, 5, 5, 0, 6, 0, 5, 5, 6, 6, 5, 6, 5, 0, 5, 5, 5, 5, 0, 6, 5, 5, 5, 5, 6, 5, 5, 6, 5, 5, 5, 0, 5, 0, 5, 5, 6, 5, 6, 5, 0, 5, 5, 0, 6, 6, 5, 6, 0, 5, 0, 6, 6, 6, 6, 5, 5, 0, 5, 0, 0, 6, 0, 6, 5, 6, 5, 5, 0, 5, 6, 6, 5, 5, 0, 0, 6, 0, 0, 5, 0, 5, 6, 5, 5, 6, 6, 5, 5, 5, 5, 5, 6, 5, 6, 6, 0, 6, 6, 5, 5, 0, 0, 6, 6, 0, 5, 0, 0, 0, 0, 0, 5, 5, 6, 5, 5, 0, 5, 5, 0, 5, 5, 6, 5, 5, 5, 6, 5, 5, 5, 0, 0, 5, 5, 5, 5, 6, 0, 0, 6, 6, 6, 6, 5, 5, 5, 6, 5, 0, 5, 5, 6, 5, 6, 6, 5, 6, 6, 5, 0, 5, 5, 5, 5, 5, 0, 0, 0, 6, 5, 6, 6, 5, 6, 0, 6, 6, 5, 6, 6, 5, 5, 5, 0] WorkLifeBalance upperlimit: 4 lowerlimit: 1 Outlier: Count : 0 Proportion : 0.0 List : [] YearsAtCompany upperlimit: 18 lowerlimit: 0 Outlier: Count : 208 Proportion : 0.0707482993197279 List : [25, 22, 22, 27, 21, 22, 37, 25, 20, 40, 20, 24, 20, 24, 33, 20, 19, 22, 33, 24, 19, 21, 20, 36, 20, 20, 22, 24, 21, 21, 25, 21, 29, 20, 27, 20, 31, 32, 20, 20, 21, 22, 22, 34, 24, 26, 31, 20, 31, 26, 19, 21, 21, 32, 21, 19, 20, 22, 20, 21, 26, 20, 22, 24, 33, 29, 25, 21, 19, 19, 20, 19, 33, 19, 19, 20, 20, 20, 20, 20, 32, 20, 21, 33, 36, 26, 30, 22, 23, 23, 21, 21, 22, 22, 19, 22, 19, 22, 20, 20, 20, 22, 20, 20, 25, 22, 22, 27, 21, 22, 37, 25, 20, 40, 20, 24, 20, 24, 33, 20, 19, 22, 33, 24, 19, 21, 20, 36, 20, 20, 22, 24, 21, 21, 25, 21, 29, 20, 27, 20, 31, 32, 20, 20, 21, 22, 22, 34, 24, 26, 31, 20, 31, 26, 19, 21, 21, 32, 21, 19, 20, 22, 20, 21, 26, 20, 22, 24, 33, 29, 25, 21, 19, 19, 20, 19, 33, 19, 19, 20, 20, 20, 20, 20, 32, 20, 21, 33, 36, 26, 30, 22, 23, 23, 21, 21, 22, 22, 19, 22, 19, 22, 20, 20, 20, 22, 20, 20] YearsInCurrentRole upperlimit: 14 lowerlimit: 0 Outlier: Count : 42 Proportion : 0.014285714285714285 List : [15, 16, 18, 15, 18, 17, 16, 15, 16, 15, 16, 16, 15, 16, 17, 15, 15, 15, 17, 17, 16, 15, 16, 18, 15, 18, 17, 16, 15, 16, 15, 16, 16, 15, 16, 17, 15, 15, 15, 17, 17, 16] YearsSinceLastPromotion upperlimit: 7 lowerlimit: 0 Outlier: Count : 214 Proportion : 0.0727891156462585 List : [8, 15, 8, 8, 9, 13, 12, 10, 11, 9, 12, 15, 15, 15, 9, 11, 11, 9, 12, 11, 15, 11, 10, 9, 11, 9, 8, 11, 11, 8, 13, 9, 9, 12, 10, 11, 15, 13, 9, 11, 10, 8, 8, 11, 9, 11, 12, 11, 14, 13, 14, 8, 11, 15, 10, 11, 11, 15, 11, 13, 11, 13, 15, 8, 13, 15, 11, 14, 15, 15, 9, 11, 9, 8, 9, 15, 11, 12, 9, 8, 10, 14, 8, 13, 13, 12, 14, 8, 8, 8, 14, 14, 8, 12, 13, 14, 14, 12, 11, 8, 11, 9, 12, 8, 9, 11, 9, 8, 15, 8, 8, 9, 13, 12, 10, 11, 9, 12, 15, 15, 15, 9, 11, 11, 9, 12, 11, 15, 11, 10, 9, 11, 9, 8, 11, 11, 8, 13, 9, 9, 12, 10, 11, 15, 13, 9, 11, 10, 8, 8, 11, 9, 11, 12, 11, 14, 13, 14, 8, 11, 15, 10, 11, 11, 15, 11, 13, 11, 13, 15, 8, 13, 15, 11, 14, 15, 15, 9, 11, 9, 8, 9, 15, 11, 12, 9, 8, 10, 14, 8, 13, 13, 12, 14, 8, 8, 8, 14, 14, 8, 12, 13, 14, 14, 12, 11, 8, 11, 9, 12, 8, 9, 11, 9] YearsWithCurrManager upperlimit: 14 lowerlimit: 0 Outlier: Count : 28 Proportion : 0.009523809523809525 List : [17, 15, 15, 15, 15, 17, 16, 17, 15, 17, 17, 17, 17, 16, 17, 15, 15, 15, 15, 17, 16, 17, 15, 17, 17, 17, 17, 16]
from scipy.stats import skew
from scipy.stats import skewtest
for x in employee_number.columns:
print(x)
print('skew:{}'.format(skew(employee_number[x])))
print('skewtest:{}'.format(skewtest(employee_number[x])))
print('\n')
Unnamed: 0 skew:0.0 skewtest:SkewtestResult(statistic=1.0010078382403174, pvalue=0.3168230189432971) EmployeeNumber skew:0.0 skewtest:SkewtestResult(statistic=1.0010078382403174, pvalue=0.3168230189432971) Age skew:0.4128644615478507 skewtest:SkewtestResult(statistic=8.813374700649044, pvalue=1.2143382069378949e-18) DailyRate skew:-0.0035149769582910268 skewtest:SkewtestResult(statistic=-0.07800437368682583, pvalue=0.9378245738968809) DistanceFromHome skew:0.9571400469829039 skewtest:SkewtestResult(statistic=18.1014778752387, pvalue=3.1023107819184016e-73) Education skew:-0.2893854052028824 skewtest:SkewtestResult(statistic=-6.295603942149894, pvalue=3.0620529158836935e-10) EnvironmentSatisfaction skew:-0.3213261358382832 skewtest:SkewtestResult(statistic=-6.959811236067978, pvalue=3.4072896924386745e-12) HourlyRate skew:-0.03227797319055414 skewtest:SkewtestResult(statistic=-0.7161299941994642, pvalue=0.4739110846945963) JobInvolvement skew:-0.49791062862696706 skewtest:SkewtestResult(statistic=-10.46301247154075, pvalue=1.2772889142432348e-25) JobLevel skew:1.0243546583925869 skewtest:SkewtestResult(statistic=19.052364536134256, pvalue=6.280041557060713e-81) JobSatisfaction skew:-0.3293354633089524 skewtest:SkewtestResult(statistic=-7.125015974617371, pvalue=1.0406861736329698e-12) MonthlyIncome skew:1.3684185123330814 skewtest:SkewtestResult(statistic=23.378240978471744, pvalue=7.115689740525391e-121) MonthlyRate skew:0.01855884556846041 skewtest:SkewtestResult(statistic=0.41182400336723884, pvalue=0.6804684265009102) NumCompaniesWorked skew:1.0254233954371303 skewtest:SkewtestResult(statistic=19.067177504506972, pvalue=4.731639919818368e-81) PercentSalaryHike skew:0.8202898522796265 skewtest:SkewtestResult(statistic=16.041863179092097, pvalue=6.516948961970625e-58) PerformanceRating skew:1.9199210412109473 skewtest:SkewtestResult(statistic=28.867389841625155, pvalue=3.0656926196297327e-183) RelationshipSatisfaction skew:-0.3025184698222079 skewtest:SkewtestResult(statistic=-6.569728847783072, pvalue=5.0406968700948565e-11) StandardHours skew:0.0 skewtest:SkewtestResult(statistic=1.0010078382403174, pvalue=0.3168230189432971) StockOptionLevel skew:0.9679912809556102 skewtest:SkewtestResult(statistic=18.257598212642677, pvalue=1.8003843683669165e-74) TotalWorkingYears skew:1.11603155825941 skewtest:SkewtestResult(statistic=20.289610312294126, pvalue=1.5884922366175207e-91) TrainingTimesLastYear skew:0.5525595985771928 skewtest:SkewtestResult(statistic=11.48409559793102, pvalue=1.585862990202731e-30) WorkLifeBalance skew:-0.5519163838185224 skewtest:SkewtestResult(statistic=-11.472257668483971, pvalue=1.8185197151995864e-30) YearsAtCompany skew:1.7627284034822992 skewtest:SkewtestResult(statistic=27.448254444670862, pvalue=7.289045104875861e-166) YearsInCurrentRole skew:0.9164268059808774 skewtest:SkewtestResult(statistic=17.50652076391757, pvalue=1.2776855008203247e-68) YearsSinceLastPromotion skew:1.9822646234628944 skewtest:SkewtestResult(statistic=29.40325736147359, pvalue=4.989550260904439e-190) YearsWithCurrManager skew:0.832600290620938 skewtest:SkewtestResult(statistic=16.23423821365354, pvalue=2.8881104371329895e-59)
for x in employee_number.columns:
# histogram
fig, axs = plt.subplots(2,1, sharex=True)
sns.histplot(employee_number[x], ax = axs[1])
plt.axvline(np.percentile(employee_number[x], 25), c='red', linestyle='--')
plt.axvline(np.median(employee_number[x]), c='red', linestyle='--')
plt.axvline(np.percentile(employee_number[x], 75), c='red', linestyle='--')
# boxplot
sns.boxplot(employee_number[x], ax = axs[0])
plt.show()
# violinplot
plt.figure(figsize = (6,3))
sns.violinplot(employee_number[x])
plt.show()
# statistic metrics
print(employee_number[x].describe())
q1 = np.percentile(employee_number[x], 25)
q3 = np.percentile(employee_number[x], 75)
iqr = q3-q1
upperbound = q3+1.5*iqr
lowerbound = q1-1.5*iqr
upperlimit = np.max(employee_number[x][employee_number[x]<=upperbound])
lowerlimit = np.min(employee_number[x][employee_number[x]>=lowerbound])
print('upperlimit: {}'.format(upperlimit))
print('lowerlimit: {}'.format(lowerlimit))
outlier = employee_number[x][(employee_number[x]>upperbound) | (employee_number[x]<lowerbound)]
print('outlier: count: {} proportion: {} list: {}'.format(len(outlier), len(outlier)/len(employee_number[x]), list(outlier)))
print('skew:{}'.format(skew(employee_number[x])))
print('skewtest:{}'.format(skewtest(employee_number[x])))
count 2940.000000 mean 1469.500000 std 848.849221 min 0.000000 25% 734.750000 50% 1469.500000 75% 2204.250000 max 2939.000000 Name: Unnamed: 0, dtype: float64 upperlimit: 2939 lowerlimit: 0 outlier: count: 0 proportion: 0.0 list: [] skew:0.0 skewtest:SkewtestResult(statistic=1.0010078382403174, pvalue=0.3168230189432971)
count 2940.000000 mean 1470.500000 std 848.849221 min 1.000000 25% 735.750000 50% 1470.500000 75% 2205.250000 max 2940.000000 Name: EmployeeNumber, dtype: float64 upperlimit: 2940 lowerlimit: 1 outlier: count: 0 proportion: 0.0 list: [] skew:0.0 skewtest:SkewtestResult(statistic=1.0010078382403174, pvalue=0.3168230189432971)
count 2940.000000 mean 36.923810 std 9.133819 min 18.000000 25% 30.000000 50% 36.000000 75% 43.000000 max 60.000000 Name: Age, dtype: float64 upperlimit: 60 lowerlimit: 18 outlier: count: 0 proportion: 0.0 list: [] skew:0.4128644615478507 skewtest:SkewtestResult(statistic=8.813374700649044, pvalue=1.2143382069378949e-18)
count 2940.000000 mean 802.485714 std 403.440447 min 102.000000 25% 465.000000 50% 802.000000 75% 1157.000000 max 1499.000000 Name: DailyRate, dtype: float64 upperlimit: 1499 lowerlimit: 102 outlier: count: 0 proportion: 0.0 list: [] skew:-0.0035149769582910268 skewtest:SkewtestResult(statistic=-0.07800437368682583, pvalue=0.9378245738968809)
count 2940.000000 mean 9.192517 std 8.105485 min 1.000000 25% 2.000000 50% 7.000000 75% 14.000000 max 29.000000 Name: DistanceFromHome, dtype: float64 upperlimit: 29 lowerlimit: 1 outlier: count: 0 proportion: 0.0 list: [] skew:0.9571400469829039 skewtest:SkewtestResult(statistic=18.1014778752387, pvalue=3.1023107819184016e-73)
count 2940.000000 mean 2.912925 std 1.023991 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 5.000000 Name: Education, dtype: float64 upperlimit: 5 lowerlimit: 1 outlier: count: 0 proportion: 0.0 list: [] skew:-0.2893854052028824 skewtest:SkewtestResult(statistic=-6.295603942149894, pvalue=3.0620529158836935e-10)
count 2940.000000 mean 2.721769 std 1.092896 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 4.000000 Name: EnvironmentSatisfaction, dtype: float64 upperlimit: 4 lowerlimit: 1 outlier: count: 0 proportion: 0.0 list: [] skew:-0.3213261358382832 skewtest:SkewtestResult(statistic=-6.959811236067978, pvalue=3.4072896924386745e-12)
count 2940.000000 mean 65.891156 std 20.325969 min 30.000000 25% 48.000000 50% 66.000000 75% 84.000000 max 100.000000 Name: HourlyRate, dtype: float64 upperlimit: 100 lowerlimit: 30 outlier: count: 0 proportion: 0.0 list: [] skew:-0.03227797319055414 skewtest:SkewtestResult(statistic=-0.7161299941994642, pvalue=0.4739110846945963)
count 2940.000000 mean 2.729932 std 0.711440 min 1.000000 25% 2.000000 50% 3.000000 75% 3.000000 max 4.000000 Name: JobInvolvement, dtype: float64 upperlimit: 4 lowerlimit: 1 outlier: count: 0 proportion: 0.0 list: [] skew:-0.49791062862696706 skewtest:SkewtestResult(statistic=-10.46301247154075, pvalue=1.2772889142432348e-25)
count 2940.000000 mean 2.063946 std 1.106752 min 1.000000 25% 1.000000 50% 2.000000 75% 3.000000 max 5.000000 Name: JobLevel, dtype: float64 upperlimit: 5 lowerlimit: 1 outlier: count: 0 proportion: 0.0 list: [] skew:1.0243546583925869 skewtest:SkewtestResult(statistic=19.052364536134256, pvalue=6.280041557060713e-81)
count 2940.000000 mean 2.728571 std 1.102658 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 4.000000 Name: JobSatisfaction, dtype: float64 upperlimit: 4 lowerlimit: 1 outlier: count: 0 proportion: 0.0 list: [] skew:-0.3293354633089524 skewtest:SkewtestResult(statistic=-7.125015974617371, pvalue=1.0406861736329698e-12)
count 2940.000000 mean 6502.931293 std 4707.155770 min 1009.000000 25% 2911.000000 50% 4919.000000 75% 8380.000000 max 19999.000000 Name: MonthlyIncome, dtype: float64 upperlimit: 16555 lowerlimit: 1009 outlier: count: 228 proportion: 0.07755102040816327 list: [19094, 18947, 19545, 18740, 18844, 18172, 17328, 16959, 19537, 17181, 19926, 19033, 18722, 19999, 16792, 19232, 19517, 19068, 19202, 19436, 16872, 19045, 19144, 17584, 18665, 17068, 19272, 18300, 16659, 19406, 19197, 19566, 18041, 17046, 17861, 16835, 16595, 19502, 18200, 16627, 19513, 19141, 19189, 16856, 19859, 18430, 17639, 16752, 19246, 17159, 17924, 17099, 17444, 17399, 19419, 18303, 19973, 19845, 17650, 19237, 19627, 16756, 17665, 16885, 17465, 19626, 19943, 18606, 17048, 17856, 19081, 17779, 19740, 18711, 18265, 18213, 18824, 18789, 19847, 19190, 18061, 17123, 16880, 17861, 19187, 19717, 16799, 17328, 19701, 17169, 16598, 17007, 16606, 19586, 19331, 19613, 17567, 19049, 19658, 17426, 17603, 16704, 19833, 19038, 19328, 19392, 19665, 16823, 17174, 17875, 19161, 19636, 19431, 18880, 19094, 18947, 19545, 18740, 18844, 18172, 17328, 16959, 19537, 17181, 19926, 19033, 18722, 19999, 16792, 19232, 19517, 19068, 19202, 19436, 16872, 19045, 19144, 17584, 18665, 17068, 19272, 18300, 16659, 19406, 19197, 19566, 18041, 17046, 17861, 16835, 16595, 19502, 18200, 16627, 19513, 19141, 19189, 16856, 19859, 18430, 17639, 16752, 19246, 17159, 17924, 17099, 17444, 17399, 19419, 18303, 19973, 19845, 17650, 19237, 19627, 16756, 17665, 16885, 17465, 19626, 19943, 18606, 17048, 17856, 19081, 17779, 19740, 18711, 18265, 18213, 18824, 18789, 19847, 19190, 18061, 17123, 16880, 17861, 19187, 19717, 16799, 17328, 19701, 17169, 16598, 17007, 16606, 19586, 19331, 19613, 17567, 19049, 19658, 17426, 17603, 16704, 19833, 19038, 19328, 19392, 19665, 16823, 17174, 17875, 19161, 19636, 19431, 18880] skew:1.3684185123330814 skewtest:SkewtestResult(statistic=23.378240978471744, pvalue=7.115689740525391e-121)
count 2940.000000 mean 14313.103401 std 7116.575021 min 2094.000000 25% 8045.000000 50% 14235.500000 75% 20462.000000 max 26999.000000 Name: MonthlyRate, dtype: float64 upperlimit: 26999 lowerlimit: 2094 outlier: count: 0 proportion: 0.0 list: [] skew:0.01855884556846041 skewtest:SkewtestResult(statistic=0.41182400336723884, pvalue=0.6804684265009102)
count 2940.000000 mean 2.693197 std 2.497584 min 0.000000 25% 1.000000 50% 2.000000 75% 4.000000 max 9.000000 Name: NumCompaniesWorked, dtype: float64 upperlimit: 8 lowerlimit: 0 outlier: count: 104 proportion: 0.03537414965986395 list: [9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9] skew:1.0254233954371303 skewtest:SkewtestResult(statistic=19.067177504506972, pvalue=4.731639919818368e-81)
count 2940.000000 mean 15.209524 std 3.659315 min 11.000000 25% 12.000000 50% 14.000000 75% 18.000000 max 25.000000 Name: PercentSalaryHike, dtype: float64 upperlimit: 25 lowerlimit: 11 outlier: count: 0 proportion: 0.0 list: [] skew:0.8202898522796265 skewtest:SkewtestResult(statistic=16.041863179092097, pvalue=6.516948961970625e-58)
count 2940.000000 mean 3.153741 std 0.360762 min 3.000000 25% 3.000000 50% 3.000000 75% 3.000000 max 4.000000 Name: PerformanceRating, dtype: float64 upperlimit: 3 lowerlimit: 3 outlier: count: 452 proportion: 0.15374149659863945 list: [4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4] skew:1.9199210412109473 skewtest:SkewtestResult(statistic=28.867389841625155, pvalue=3.0656926196297327e-183)
count 2940.000000 mean 2.712245 std 1.081025 min 1.000000 25% 2.000000 50% 3.000000 75% 4.000000 max 4.000000 Name: RelationshipSatisfaction, dtype: float64 upperlimit: 4 lowerlimit: 1 outlier: count: 0 proportion: 0.0 list: [] skew:-0.3025184698222079 skewtest:SkewtestResult(statistic=-6.569728847783072, pvalue=5.0406968700948565e-11)
count 2940.0 mean 80.0 std 0.0 min 80.0 25% 80.0 50% 80.0 75% 80.0 max 80.0 Name: StandardHours, dtype: float64 upperlimit: 80 lowerlimit: 80 outlier: count: 0 proportion: 0.0 list: [] skew:0.0 skewtest:SkewtestResult(statistic=1.0010078382403174, pvalue=0.3168230189432971)
count 2940.000000 mean 0.793878 std 0.851932 min 0.000000 25% 0.000000 50% 1.000000 75% 1.000000 max 3.000000 Name: StockOptionLevel, dtype: float64 upperlimit: 2 lowerlimit: 0 outlier: count: 170 proportion: 0.05782312925170068 list: [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3] skew:0.9679912809556102 skewtest:SkewtestResult(statistic=18.257598212642677, pvalue=1.8003843683669165e-74)
count 2940.000000 mean 11.279592 std 7.779458 min 0.000000 25% 6.000000 50% 10.000000 75% 15.000000 max 40.000000 Name: TotalWorkingYears, dtype: float64 upperlimit: 28 lowerlimit: 0 outlier: count: 126 proportion: 0.04285714285714286 list: [31, 29, 37, 38, 30, 40, 36, 34, 32, 33, 37, 30, 36, 31, 33, 32, 37, 31, 32, 32, 30, 34, 30, 40, 29, 35, 31, 33, 31, 29, 32, 30, 33, 30, 29, 31, 32, 33, 36, 34, 31, 36, 33, 31, 29, 33, 29, 32, 31, 35, 29, 32, 34, 36, 32, 30, 36, 29, 34, 37, 29, 29, 35, 31, 29, 37, 38, 30, 40, 36, 34, 32, 33, 37, 30, 36, 31, 33, 32, 37, 31, 32, 32, 30, 34, 30, 40, 29, 35, 31, 33, 31, 29, 32, 30, 33, 30, 29, 31, 32, 33, 36, 34, 31, 36, 33, 31, 29, 33, 29, 32, 31, 35, 29, 32, 34, 36, 32, 30, 36, 29, 34, 37, 29, 29, 35] skew:1.11603155825941 skewtest:SkewtestResult(statistic=20.289610312294126, pvalue=1.5884922366175207e-91)
count 2940.000000 mean 2.799320 std 1.289051 min 0.000000 25% 2.000000 50% 3.000000 75% 3.000000 max 6.000000 Name: TrainingTimesLastYear, dtype: float64 upperlimit: 4 lowerlimit: 1 outlier: count: 476 proportion: 0.1619047619047619 list: [0, 5, 5, 5, 6, 5, 5, 5, 6, 6, 0, 0, 0, 5, 0, 5, 5, 5, 6, 6, 5, 0, 6, 5, 5, 0, 5, 5, 6, 5, 5, 5, 0, 5, 5, 5, 5, 6, 6, 5, 5, 5, 5, 0, 0, 5, 5, 5, 6, 6, 5, 0, 5, 0, 5, 5, 0, 6, 0, 5, 5, 6, 6, 5, 6, 5, 0, 5, 5, 5, 5, 0, 6, 5, 5, 5, 5, 6, 5, 5, 6, 5, 5, 5, 0, 5, 0, 5, 5, 6, 5, 6, 5, 0, 5, 5, 0, 6, 6, 5, 6, 0, 5, 0, 6, 6, 6, 6, 5, 5, 0, 5, 0, 0, 6, 0, 6, 5, 6, 5, 5, 0, 5, 6, 6, 5, 5, 0, 0, 6, 0, 0, 5, 0, 5, 6, 5, 5, 6, 6, 5, 5, 5, 5, 5, 6, 5, 6, 6, 0, 6, 6, 5, 5, 0, 0, 6, 6, 0, 5, 0, 0, 0, 0, 0, 5, 5, 6, 5, 5, 0, 5, 5, 0, 5, 5, 6, 5, 5, 5, 6, 5, 5, 5, 0, 0, 5, 5, 5, 5, 6, 0, 0, 6, 6, 6, 6, 5, 5, 5, 6, 5, 0, 5, 5, 6, 5, 6, 6, 5, 6, 6, 5, 0, 5, 5, 5, 5, 5, 0, 0, 0, 6, 5, 6, 6, 5, 6, 0, 6, 6, 5, 6, 6, 5, 5, 5, 0, 0, 5, 5, 5, 6, 5, 5, 5, 6, 6, 0, 0, 0, 5, 0, 5, 5, 5, 6, 6, 5, 0, 6, 5, 5, 0, 5, 5, 6, 5, 5, 5, 0, 5, 5, 5, 5, 6, 6, 5, 5, 5, 5, 0, 0, 5, 5, 5, 6, 6, 5, 0, 5, 0, 5, 5, 0, 6, 0, 5, 5, 6, 6, 5, 6, 5, 0, 5, 5, 5, 5, 0, 6, 5, 5, 5, 5, 6, 5, 5, 6, 5, 5, 5, 0, 5, 0, 5, 5, 6, 5, 6, 5, 0, 5, 5, 0, 6, 6, 5, 6, 0, 5, 0, 6, 6, 6, 6, 5, 5, 0, 5, 0, 0, 6, 0, 6, 5, 6, 5, 5, 0, 5, 6, 6, 5, 5, 0, 0, 6, 0, 0, 5, 0, 5, 6, 5, 5, 6, 6, 5, 5, 5, 5, 5, 6, 5, 6, 6, 0, 6, 6, 5, 5, 0, 0, 6, 6, 0, 5, 0, 0, 0, 0, 0, 5, 5, 6, 5, 5, 0, 5, 5, 0, 5, 5, 6, 5, 5, 5, 6, 5, 5, 5, 0, 0, 5, 5, 5, 5, 6, 0, 0, 6, 6, 6, 6, 5, 5, 5, 6, 5, 0, 5, 5, 6, 5, 6, 6, 5, 6, 6, 5, 0, 5, 5, 5, 5, 5, 0, 0, 0, 6, 5, 6, 6, 5, 6, 0, 6, 6, 5, 6, 6, 5, 5, 5, 0] skew:0.5525595985771928 skewtest:SkewtestResult(statistic=11.48409559793102, pvalue=1.585862990202731e-30)
count 2940.000000 mean 2.761224 std 0.706356 min 1.000000 25% 2.000000 50% 3.000000 75% 3.000000 max 4.000000 Name: WorkLifeBalance, dtype: float64 upperlimit: 4 lowerlimit: 1 outlier: count: 0 proportion: 0.0 list: [] skew:-0.5519163838185224 skewtest:SkewtestResult(statistic=-11.472257668483971, pvalue=1.8185197151995864e-30)
count 2940.000000 mean 7.008163 std 6.125483 min 0.000000 25% 3.000000 50% 5.000000 75% 9.000000 max 40.000000 Name: YearsAtCompany, dtype: float64 upperlimit: 18 lowerlimit: 0 outlier: count: 208 proportion: 0.0707482993197279 list: [25, 22, 22, 27, 21, 22, 37, 25, 20, 40, 20, 24, 20, 24, 33, 20, 19, 22, 33, 24, 19, 21, 20, 36, 20, 20, 22, 24, 21, 21, 25, 21, 29, 20, 27, 20, 31, 32, 20, 20, 21, 22, 22, 34, 24, 26, 31, 20, 31, 26, 19, 21, 21, 32, 21, 19, 20, 22, 20, 21, 26, 20, 22, 24, 33, 29, 25, 21, 19, 19, 20, 19, 33, 19, 19, 20, 20, 20, 20, 20, 32, 20, 21, 33, 36, 26, 30, 22, 23, 23, 21, 21, 22, 22, 19, 22, 19, 22, 20, 20, 20, 22, 20, 20, 25, 22, 22, 27, 21, 22, 37, 25, 20, 40, 20, 24, 20, 24, 33, 20, 19, 22, 33, 24, 19, 21, 20, 36, 20, 20, 22, 24, 21, 21, 25, 21, 29, 20, 27, 20, 31, 32, 20, 20, 21, 22, 22, 34, 24, 26, 31, 20, 31, 26, 19, 21, 21, 32, 21, 19, 20, 22, 20, 21, 26, 20, 22, 24, 33, 29, 25, 21, 19, 19, 20, 19, 33, 19, 19, 20, 20, 20, 20, 20, 32, 20, 21, 33, 36, 26, 30, 22, 23, 23, 21, 21, 22, 22, 19, 22, 19, 22, 20, 20, 20, 22, 20, 20] skew:1.7627284034822992 skewtest:SkewtestResult(statistic=27.448254444670862, pvalue=7.289045104875861e-166)
count 2940.000000 mean 4.229252 std 3.622521 min 0.000000 25% 2.000000 50% 3.000000 75% 7.000000 max 18.000000 Name: YearsInCurrentRole, dtype: float64 upperlimit: 14 lowerlimit: 0 outlier: count: 42 proportion: 0.014285714285714285 list: [15, 16, 18, 15, 18, 17, 16, 15, 16, 15, 16, 16, 15, 16, 17, 15, 15, 15, 17, 17, 16, 15, 16, 18, 15, 18, 17, 16, 15, 16, 15, 16, 16, 15, 16, 17, 15, 15, 15, 17, 17, 16] skew:0.9164268059808774 skewtest:SkewtestResult(statistic=17.50652076391757, pvalue=1.2776855008203247e-68)
count 2940.000000 mean 2.187755 std 3.221882 min 0.000000 25% 0.000000 50% 1.000000 75% 3.000000 max 15.000000 Name: YearsSinceLastPromotion, dtype: float64 upperlimit: 7 lowerlimit: 0 outlier: count: 214 proportion: 0.0727891156462585 list: [8, 15, 8, 8, 9, 13, 12, 10, 11, 9, 12, 15, 15, 15, 9, 11, 11, 9, 12, 11, 15, 11, 10, 9, 11, 9, 8, 11, 11, 8, 13, 9, 9, 12, 10, 11, 15, 13, 9, 11, 10, 8, 8, 11, 9, 11, 12, 11, 14, 13, 14, 8, 11, 15, 10, 11, 11, 15, 11, 13, 11, 13, 15, 8, 13, 15, 11, 14, 15, 15, 9, 11, 9, 8, 9, 15, 11, 12, 9, 8, 10, 14, 8, 13, 13, 12, 14, 8, 8, 8, 14, 14, 8, 12, 13, 14, 14, 12, 11, 8, 11, 9, 12, 8, 9, 11, 9, 8, 15, 8, 8, 9, 13, 12, 10, 11, 9, 12, 15, 15, 15, 9, 11, 11, 9, 12, 11, 15, 11, 10, 9, 11, 9, 8, 11, 11, 8, 13, 9, 9, 12, 10, 11, 15, 13, 9, 11, 10, 8, 8, 11, 9, 11, 12, 11, 14, 13, 14, 8, 11, 15, 10, 11, 11, 15, 11, 13, 11, 13, 15, 8, 13, 15, 11, 14, 15, 15, 9, 11, 9, 8, 9, 15, 11, 12, 9, 8, 10, 14, 8, 13, 13, 12, 14, 8, 8, 8, 14, 14, 8, 12, 13, 14, 14, 12, 11, 8, 11, 9, 12, 8, 9, 11, 9] skew:1.9822646234628944 skewtest:SkewtestResult(statistic=29.40325736147359, pvalue=4.989550260904439e-190)
count 2940.000000 mean 4.123129 std 3.567529 min 0.000000 25% 2.000000 50% 3.000000 75% 7.000000 max 17.000000 Name: YearsWithCurrManager, dtype: float64 upperlimit: 14 lowerlimit: 0 outlier: count: 28 proportion: 0.009523809523809525 list: [17, 15, 15, 15, 15, 17, 16, 17, 15, 17, 17, 17, 17, 16, 17, 15, 15, 15, 15, 17, 16, 17, 15, 17, 17, 17, 17, 16] skew:0.832600290620938 skewtest:SkewtestResult(statistic=16.23423821365354, pvalue=2.8881104371329895e-59)
Kesimpulan : kesimpulan akan memperharikan outlier dan kesebaran datanya(skew)
- ada 9 kolom yang memiliki nilai outliner. berikut detail dari kolom tersebut
- MonthlyIncome dengan jumlah 228
- PerformanceRating dengan jumlah 452
- StockOptionLevel dengan jumlah 170
- TotalWorkingYears dengan jumlah 126
- TrainingTimeLastYear dengan jumlah 476
- YearsAtCompany dengan jumlah 288
- YearInCurrentRole dengan jumlah 42
- YearsSinceLastPromotion dengan jumlah 214
- YearsWithCurrManager dengan jumlah 28
- ada 5 kolom yang memiliki skew negatif. hal ini dapat dilihat dari nilai skew yang negatif dan pvalue yang kurang dari alpha. berikut untuk nama kolomnya, EnvironmentSatisfaction, JobInvolvement, JobStatisfaction, RelationshipSatisfaction, WorkLifeBalance
- ada 14 kolm yang memiliki skew positif. hal ini dapat dilihat dari nilai skew yang positif dan memiliki pvalue yang kirang dari alpha. berikut menurpakan nama dari kolom tersebut, Age, DistanceFromHome, JobLevel, MonthlyIncome, NumCompaniesWorked, PrecentSalaryHike, PerformanceRating, StockOptionLevel, TotalWorkingYears, TrainingTimesLastYear, YearsAtCompany, YearsInCurrentRole, YearsSinceLastPromotion, YearsWithCurrManager
(2) Lakukan EDA univariat untuk setiap kolom kategorikal pada employee.csv yang mencakup:
a. countplot untuk tiap kolom
b. daftar kategori unik dan frekuensi tiap kolom
c. identifikasi hal yang menurut anda menarik dari hasil EDA yang Anda dapatkan
employee_obj = pd_employee.select_dtypes(include = 'object')
employee_obj
| Attrition | BusinessTravel | Department | EducationField | Gender | JobRole | MaritalStatus | Over18 | OverTime | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | Yes | Travel_Rarely | Sales | Life Sciences | Female | Sales Executive | Single | Y | Yes |
| 1 | No | Travel_Frequently | Research & Development | Life Sciences | Male | Research Scientist | Married | Y | No |
| 2 | Yes | Travel_Rarely | Research & Development | Other | Male | Laboratory Technician | Single | Y | Yes |
| 3 | No | Travel_Frequently | Research & Development | Life Sciences | Female | Research Scientist | Married | Y | Yes |
| 4 | No | Travel_Rarely | Research & Development | Medical | Male | Laboratory Technician | Married | Y | No |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 2935 | No | Travel_Frequently | Research & Development | Medical | Male | Laboratory Technician | Married | Y | No |
| 2936 | No | Travel_Rarely | Research & Development | Medical | Male | Healthcare Representative | Married | Y | No |
| 2937 | No | Travel_Rarely | Research & Development | Life Sciences | Male | Manufacturing Director | Married | Y | Yes |
| 2938 | No | Travel_Frequently | Sales | Medical | Male | Sales Executive | Married | Y | No |
| 2939 | No | Travel_Rarely | Research & Development | Medical | Male | Laboratory Technician | Married | Y | No |
2940 rows × 9 columns
for x in employee_obj.columns:
sns.countplot(employee_obj[x])
plt.xticks(rotation=45)
plt.show()
Kesimpulan :
- Ada beberapa kolom yang memiliki atribut yang imbalance seperti
- Attrition yang memiliki terlalu banyak kategori No dari pada Yes
- Overtime yang memiliki terlalu banyak kategori No dari pada yes
- kesimpulan yang bisa diambil dari kolom Over18 adalah semua karyawan sudah berusia diatas 18 tahun
- Sebagian besar karyawan sudah pernah menikah (married dan divorced) yang bisa dilihat di kolom MaritalStatus
- Tipe pekerjaan yang paling banyak dalam perusahaan adalah Sales Executive yang bisa dilihat di kolom JobRole
- Sebagian besar karyawan memiliki gender laki-laki yang bisa dilihat di kolom Gender
- sebagian besar karyawan tidak terlalu suka berpegian
(3) Lakukan EDA multivariat untuk tiap pasangan kolom numerik-numerik pada employee.csv yang mencakup:
a. scatterplot antar kolom numerik dengan kolom 'attrition' sebagai hue
b. identifikasi hal yang menurut anda menarik dari hasil EDA yang Anda dapatkan
sns.pairplot(pd_employee, hue='Attrition')
<seaborn.axisgrid.PairGrid at 0x1f245be7550>